Gene-level differential expression analysis using DESeq2

Coral Host

DESeq2 package

The package DESeq2 provides methods to test for differential expression by use of negative binomial generalized linear models; the estimates of dispersion and logarithmic fold changes incorporate data-driven prior distributions.

More information on the program and specific functions: https://www.bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html

Import libraries

library(RColorBrewer)
library(circlize)
library(pheatmap)
library(gplots) #heatmap.2()
library(ComplexHeatmap)

library(tidyverse)
library(vegan)
library(cowplot)
library(ggrepel)
library(devtools)
library(vsn)
library(dplyr)
library(plyr)
library(ggplot2)
library(knitr)
library(gridExtra)
library(data.table)

# first time installation:
# if (!requireNamespace("BiocManager", quietly = TRUE))
#     install.packages("BiocManager")
# BiocManager::install("DESeq2")
# BiocManager::install("DEGreport")
# BiocManager::install("arrayQualityMetrics")
library(BiocManager)
library(DESeq2)
library(DEGreport)
library(arrayQualityMetrics)
library(locfit)

Set working directory

setwd("~/Documents/Rprojects/postdoc\ Rprojects/ODU_postdoc_Rprojects/Paper_Ojo_gene-expression/Porites_astreoides/host/")

Counts data

As input, the DESeq2 package expects count data as obtained, e.g., from RNA-seq in the form of a matrix of integer values. The value in the i-th row and the j-th column of the matrix tells how many reads can be assigned to gene i in sample j.

The values in the matrix should be un-normalized counts or estimated counts of sequencing reads (for single-end RNA-seq) or fragments (for paired-end RNA-seq).

The DESeq2 model internally corrects for library size, so transformed or normalized values such as counts scaled by library size should not be used as input.

Import raw data

host <- read.csv("Genes_Past_OA_ojo.csv", header=TRUE, row.names=1)
dim(host)  
## [1] 29422    40
## Some genes were not detected in all samples (genes with zero reads)
host <- host[rowSums(host) > 0, ]
dim(host)  
## [1] 29201    40

Metadata (conditions)

NB! It is absolutely critical that the columns of the count matrix and the rows of the column data (information about samples) are in the same order. There should be the same number of conditions described as there are samples in your data file, and in the same order.

rownames of meta have to match the column names of counts file

meta <- read.csv("~/Documents/Rprojects/postdoc\ Rprojects/ODU_postdoc_Rprojects/Paper_Ojo_gene-expression/Porites_astreoides/P_astreoides_Year-2_metadata.csv", header=TRUE, row.names=1)

# Remove all samples that were transplanted to the Destination Ojo.Norte
# There is no control site for Norte Ojo
# For now, we are focused on samples transplanted to Laja Control or Laja Ojo sites
# This is consistent with analyses for the other 2 species in the experiment
# Also, group sample size is too small for Ojo.Norte destination

meta <- subset(meta, Destination_name != "Ojo.Norte")
meta$Origin_type <- revalue(meta$Origin_type, c("control" = "Control", "ojo" = "Ojo"))
meta$Destination_type <- revalue(meta$Destination_type, c("control" = "Control", "ojo" = "Ojo"))

meta$Origin_type <- revalue(meta$Origin_type, c("Control" = "Lagoon"))

meta <- meta %>% unite(group, Origin_type, Destination_type, sep = ".", remove = FALSE)


meta[sapply(meta, is.character)] <- lapply(meta[sapply(meta, is.character)], 
                                       as.factor)
# Colony_ID is equivalent to genotype
meta$Colony_ID <- as.factor(meta$Colony_ID)

levels(meta$group)
## [1] "Lagoon.Control" "Lagoon.Ojo"     "Ojo.Control"    "Ojo.Ojo"

select relevant metadata

meta <- dplyr::select(meta, Colony_ID, group, Origin_name, Origin_type, Destination_name, Destination_type, pH_Destination)

Descriptive statistics

## The do.call() function produces a data frame with one col per sample, 
## transpose it to obtain one row per sample and one column per statistics.
stats.per.sample <- data.frame(t(do.call(cbind, lapply(host, summary))))

stats.per.sample$libsum <- apply(host, 2, sum) ## libsum
stats.per.sample$zeros <- apply(host==0, 2, sum)
stats.per.sample$percent.zeros <- 100*stats.per.sample$zeros/nrow(host)

head(stats.per.sample)
##                 Min. X1st.Qu. Median     Mean X3rd.Qu.    Max.   libsum zeros
## NC_291_C_Pa_yr2    0        0      1 123.8224        3 1456596  3615737 11584
## NC_294_C_Pa_yr2    0        1      3 214.4352        7 2465445  6261721  6291
## NC_297_C_Pa_yr2    0        0      2 232.1235        5 2309389  6778237  8408
## NC_300_C_Pa_yr2    0        1      3 439.8108        7 4476840 12842915  6613
## NC_313_C_Pa_yr2    0        0      1 165.3343        3 1777139  4827927 10120
## NC_316_C_Pa_yr2    0        0      2 228.7311        5 2247513  6679176  8048
##                 percent.zeros
## NC_291_C_Pa_yr2      39.66987
## NC_294_C_Pa_yr2      21.54378
## NC_297_C_Pa_yr2      28.79353
## NC_300_C_Pa_yr2      22.64648
## NC_313_C_Pa_yr2      34.65635
## NC_316_C_Pa_yr2      27.56070

Distributions

Histograms of counts per gene

  • Top: raw counts. the scale is determined by the gene with the highest count, which is apparently an outlier.
  • Middle: raw counts, with X axis truncated to 2000 in order to display a representative range despite outliers.
  • Bottom: log2-transformed counts (bottom) per gene, with a pseudocount of 1 to avoid minus infinitevalues resulting from zero counts.
par(mfrow=c(3,1))

hist(as.matrix(host), col="blue", border="white", breaks=50)

hist(as.matrix(host), col="blue", border="white",
     breaks=20000, xlim=c(0,500), main="Counts per gene",
     xlab="Counts (truncated axis)", ylab="Number of genes", 
     las=1, cex.axis=0.7)

epsilon <- 1 # pseudo-count to avoid problems with log(0)
hist(as.matrix(log2(host + epsilon)), breaks=100, col="blue", border="white",
     main="Log2-transformed counts per gene", xlab="log2(counts+1)", ylab="Number of genes",
     las=1, cex.axis=0.7)

Boxplots of gene count distributions per sample (non-normalized log2(counts))

boxplot(log2(host + epsilon), col=meta$group, pch=".", 
        horizontal=TRUE, cex.axis=0.5, las=1, 
        xlab="log2(Counts +1)")

Density plots (log2(counts))

if(!require("affy")){
  source("http://bioconductor.org/biocLite.R")
  biocLite("affy")  
}
library(affy)

## Each curve corresponds to one sample 
plotDensity(log2(host + epsilon), lty=1, col= meta$group, lwd=2)
grid()

NB! the R function plotDensity() does not display the actual distribution of your values, but a polynomial fit. The representation thus generally looks smoother than the actual data. It is important to realize that, in some particular cases, the fit can lead to extrapolated values which can be misleading.

Filtering

The filtering of low-expression genes is a common practice in the analysis of RNA-seq data. There are several reasons for this. For the detection of differentially expressed genes (DEGs) and from a biological point of view, genes that not expressed at a biologically meaningful level in any condition are not of interest and are therefore best ignored, but also because it can increase the number of total DEGs after correction of multiple testing, improving sensitivity and the precision of DEGs after filtering (Bourgon et al., 2010).

In addition, genes/transcripts having a low read count are generally considered as artifacts or ‘noise’. From a statistical point of view, removing low count genes allows the mean-variance relationship in the data to be estimated with greater reliability (Law et al., 2016).

https://seqqc.wordpress.com/2020/02/17/removing-low-count-genes-for-rna-seq-downstream-analysis/

# how many genes have mean count >=3 
host_means_filtered <- cbind(host, means = apply(host, 1, mean))
table(host_means_filtered$means>=3)
## 
## FALSE  TRUE 
## 21083  8118
host_means_filtered <- subset(host_means_filtered, host_means_filtered$means>=3)
host_means_filtered <- host_means_filtered[, 1:40]
dim(host_means_filtered)
## [1] 8118   40

Although RNA-seq technology has improved the dynamic range of gene expression quantification, low-expression genes may be indistinguishable from sampling noise. The presence of noisy, low-expression genes can decrease the sensitivity of detecting DEGs. Thus, identification and filtering of these low-expression genes may improve DEG detection sensitivity.

Genes with very low counts across all samples provide little evidence for differential expression and they interfere with some of the statistical approximations that are used later in the pipeline. They also add to the multiple testing burden when estimating false discovery rates, reducing power to detect differentially expressed genes. These genes should be filtered out prior to further analysis.

For RNA-seq summarized to counts, suggestion to filter on absolute counts, since one could make the argument that genes with low observed counts are probably not really expressed, or that their expression cannot be reliably measured. Furthermore, low count data (which usually also contains many zeros) are not really suitable for a correlation analysis. This simple approach works fine for bulk (tissue) sequencing data but be very careful about applying it to single-cell sequencing where many interesting genes may have zero counts in most cells, and relatively low counts in the rest.

Boxplots of gene count distributions per sample

boxplot(log2(host_means_filtered + epsilon), col=meta$group, pch=".", 
        horizontal=TRUE, cex.axis=0.5, las=1, 
        xlab="log2(Counts +1)")

Remove Destination Ojo.Norte samples

drop <- c("NC_318_N_Pa_yr2", "NO_278_N_Pa_yr2", "NO_266_N_Pa_yr2", "NC_327_N_Pa_yr2", "NO_275_N_Pa_yr2", "NC_293_N_Pa_yr2", "NO_272_N_Pa_yr2", "NC_312_N_Pa_yr2")

host_means_filtered <- host_means_filtered[,!(names(host_means_filtered) %in% drop)]
meta <- meta[!(row.names(meta) %in% drop),]

host_means_filtered <- droplevels(host_means_filtered)
meta <- droplevels(meta)

Percentage of null counts

prop.null <- apply(host_means_filtered, 2, function(x) 100*mean(x==0))

barplot(prop.null, main="Percentage of null counts per sample", 
        horiz=TRUE, cex.names=0.5, las=1, 
        col=meta$group, ylab='Samples', xlab='% of null counts')

Note: 3 samples have >85% of null (zero) counts

prop.null[order(prop.null)]
##  NC_325_C_Pa_yr2  NO_288_C_Pa_yr2  NC_294_C_Pa_yr2 NO_289_La_Pa_yr2 
##        0.2340478        0.2956393        0.3449125        0.3695492 
##  NC_300_C_Pa_yr2  NO_279_C_Pa_yr2  NO_261_C_Pa_yr2 NC_317_La_Pa_yr2 
##        0.4311407        0.8745997        0.8869180        1.0716925 
##  NC_316_C_Pa_yr2  NO_264_C_Pa_yr2  NC_297_C_Pa_yr2 NC_314_La_Pa_yr2 
##        1.3919685        1.4166051        1.6136980        1.8354274 
##  NO_276_C_Pa_yr2  NC_319_C_Pa_yr2  NC_313_C_Pa_yr2  NO_267_C_Pa_yr2 
##        2.1310668        2.1680217        2.4636610        2.8948017 
##  NO_282_C_Pa_yr2 NC_295_La_Pa_yr2  NC_322_C_Pa_yr2 NO_277_La_Pa_yr2 
##        3.6215817        3.9418576        3.9788125        4.6193644 
##  NC_291_C_Pa_yr2 NO_271_La_Pa_yr2 NC_326_La_Pa_yr2 NC_298_La_Pa_yr2 
##        4.8410939        5.2106430        5.5925105        5.8635132 
## NO_274_La_Pa_yr2  NO_270_C_Pa_yr2 NO_265_La_Pa_yr2 NO_268_La_Pa_yr2 
##       10.6306972       12.1088938       63.0697216       74.4395171 
## NO_280_La_Pa_yr2 NC_292_La_Pa_yr2 NC_320_La_Pa_yr2 NC_311_La_Pa_yr2 
##       84.0601133       87.6201035       88.7903424       93.9147573

Remove sample(s) with >85% of null/zero counts

drop <- c("NC_292_La_Pa_yr2", "NC_320_La_Pa_yr2", "NC_311_La_Pa_yr2")

host_means_filtered <- host_means_filtered[,!(names(host_means_filtered) %in% drop)]
meta <- meta[!(row.names(meta) %in% drop),]

prop.null <- apply(host_means_filtered, 2, function(x) 100*mean(x==0))
prop.null[order(prop.null)]
##  NC_325_C_Pa_yr2  NO_288_C_Pa_yr2  NC_294_C_Pa_yr2 NO_289_La_Pa_yr2 
##        0.2340478        0.2956393        0.3449125        0.3695492 
##  NC_300_C_Pa_yr2  NO_279_C_Pa_yr2  NO_261_C_Pa_yr2 NC_317_La_Pa_yr2 
##        0.4311407        0.8745997        0.8869180        1.0716925 
##  NC_316_C_Pa_yr2  NO_264_C_Pa_yr2  NC_297_C_Pa_yr2 NC_314_La_Pa_yr2 
##        1.3919685        1.4166051        1.6136980        1.8354274 
##  NO_276_C_Pa_yr2  NC_319_C_Pa_yr2  NC_313_C_Pa_yr2  NO_267_C_Pa_yr2 
##        2.1310668        2.1680217        2.4636610        2.8948017 
##  NO_282_C_Pa_yr2 NC_295_La_Pa_yr2  NC_322_C_Pa_yr2 NO_277_La_Pa_yr2 
##        3.6215817        3.9418576        3.9788125        4.6193644 
##  NC_291_C_Pa_yr2 NO_271_La_Pa_yr2 NC_326_La_Pa_yr2 NC_298_La_Pa_yr2 
##        4.8410939        5.2106430        5.5925105        5.8635132 
## NO_274_La_Pa_yr2  NO_270_C_Pa_yr2 NO_265_La_Pa_yr2 NO_268_La_Pa_yr2 
##       10.6306972       12.1088938       63.0697216       74.4395171 
## NO_280_La_Pa_yr2 
##       84.0601133
prop.null <- apply(host_means_filtered, 2, function(x) 100*mean(x==0))

barplot(prop.null, main="Percentage of null counts per sample", 
        horiz=TRUE, cex.names=0.5, las=1, 
        col=meta$group, ylab='Samples', xlab='% of null counts')

data summary

meta %>%
  group_by(group) %>%
  dplyr::summarise(count = n())
## # A tibble: 4 x 2
##   group          count
## * <fct>          <int>
## 1 Lagoon.Control     9
## 2 Lagoon.Ojo         5
## 3 Ojo.Control        8
## 4 Ojo.Ojo            7

RNA-seq count distribution

To determine the appropriate statistical model, we need information about the distribution of counts.

These images illustrate some common features of RNA-seq count data, including - a low number of counts associated with a large proportion of genes, and - a long right tail due to the lack of any upper limit for expression.

# To get an idea about how RNA-seq counts are distributed, plot counts for a couple samples
p1 <- ggplot(host_means_filtered) +
  geom_histogram(aes(x = NC_291_C_Pa_yr2), stat = "bin", bins = 200) +
  xlim(-5, 500)  +
  xlab("Raw expression counts") +
  ylab("Number of genes")

p2 <- ggplot(host_means_filtered) +
  geom_histogram(aes(x = NO_289_La_Pa_yr2), stat = "bin", bins = 200) +
  xlim(-5, 500)  +
  xlab("Raw expression counts") +
  ylab("Number of genes")

grid.arrange(p1, p2, ncol = 2)
## Warning: Removed 30 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 38 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).

Modeling count data

Count data is often modeled using the binomial distribution, which can give you the probability of getting a number of heads upon tossing a coin a number of times. However, not all count data can be fit with the binomial distribution. The binomial is based on discrete events and used in situations when you have a certain number of cases.

When the number of cases is very large (i.e. people who buy lottery tickets), but the probability of an event is very small (probability of winning), the Poisson distribution is used to model these types of count data. The Poisson is similar to the binomial, but is based on continuous events. Details provided by Rafael Irizarry in the EdX class.

With RNA-Seq data, a very large number of RNAs are represented and the probability of pulling out a particular transcript is very small. Thus, it would be an appropriate situation to use the Poisson distribution. However, a unique property of this distribution is that the mean == variance. Realistically, with RNA-Seq data there is always some biological variation present across the replicates (within a sample class). Genes with larger average expression levels will tend to have larger observed variances across replicates.

If the proportions of mRNA stayed exactly constant between the biological replicates for each sample class, we could expect Poisson distribution (where mean == variance). A nice description of this concept is presented by Rafael Irizarry in the EdX class. But this doesn’t happen in practice, and so the Poisson distribution is only considered appropriate for a single biological sample.

The model that fits best, given this type of variability between replicates, is the Negative Binomial (NB) model. Essentially, the NB model is a good approximation for data where the mean < variance, as is the case with RNA-Seq count data.

plot the mean versus the variance of your data

Remember for the Poisson model, mean = variance, but for Negative Binomial, mean < variance

mean_counts <- apply(host_means_filtered, 1, mean) 
variance_counts <- apply(host_means_filtered, 1, var) 
df <- data.frame(mean_counts, variance_counts)

ggplot(df) +
        geom_point(aes(x=mean_counts, y=variance_counts)) + 
        geom_line(aes(x=mean_counts, y=mean_counts, color="red")) +
        scale_y_log10() +
        scale_x_log10() +
        theme(legend.position = "none")

Note that in the above figure, the variance across replicates tends to be greater than the mean (red line), especially for genes with large mean expression levels. This is a good indication that our data do not fit the Poisson distribution and we need to account for this increase in variance using the Negative Binomial model (i.e. Poisson will underestimate variability leading to an increase in false positive DE genes).

How does the dispersion relate to our model?

To accurately model sequencing counts, we need to generate accurate estimates of within-group variation (variation between replicates of the same sample group) for each gene. With only a few (3-6) replicates per group, the estimates of variation for each gene are often unreliable (due to the large differences in dispersion for genes with similar means).

To address this problem, DESeq2 shares information across genes to generate more accurate estimates of variation based on the mean expression level of the gene using a method called ‘shrinkage’. DESeq2 assumes that genes with similar expression levels have similar dispersion.

Estimating the dispersion for each gene separately: To model the dispersion based on expression level (mean counts of replicates), the dispersion for each gene is estimated using maximum likelihood estimation. In other words, given the count values of the replicates, the most likely estimate of dispersion is calculated.

Count normalization using DESeq2

Ensure the row names of the metadata dataframe are present and in the same order as the column names of the counts dataframe.

To create the object we will need the count matrix and the metadata table as input. We will also need to specify a design formula. The design formula specifies the column(s) in the metadata table and how they should be used in the analysis.

# Check that sample names match in both files
all(colnames(host_means_filtered) %in% rownames(meta))
## [1] TRUE
all(colnames(host_means_filtered) == rownames(meta))
## [1] TRUE
# If your data did not match, you could use the match() function to rearrange them to be matching.
levels(meta$group)
## [1] "Lagoon.Control" "Lagoon.Ojo"     "Ojo.Control"    "Ojo.Ojo"

Create DESeq2 Dataset object

# can only do 2 group comparisons (limitation of DESeq2)

dds <- DESeqDataSetFromMatrix(countData = host_means_filtered,
                              colData = meta,
                              design= ~group) 

Note that neither rlog transformation nor the VST are used by the differential expression estimation in DESeq, which always occurs on the raw count data, through generalized linear modeling which incorporates knowledge of the variance-mean dependence. The rlog transformation and VST are offered as separate functionality which can be used for visualization, clustering or other machine learning tasks.

Count data transformation

In order to test for differential expression, we operate on raw counts and use discrete distributions as described in the previous section on differential expression. However for other downstream analyses – e.g. for visualization or clustering – it is useful to work with transformed versions of the count data.

Both transformations produce transformed data on the log2 scale which has been normalized with respect to library size or other normalization factors.

The point of these two transformations, the VST and the rlog, is to remove the dependence of the variance on the mean, particularly the high variance of the logarithm of count data when the mean is low. In particular, genes with low expression level and therefore low read counts tend to have high variance, which is not removed efficiently by the ordinary logarithmic transformation.

Both VST and rlog use the experiment-wide trend of variance over mean, in order to transform the data to remove the experiment-wide trend. Note that we do not require or desire that all the genes have exactly the same variance after transformation. Indeed, in a figure below, you will see that after the transformations the genes with the same mean do not have exactly the same standard deviations, but that the experiment-wide trend has flattened. It is those genes with row variance above the trend which will allow us to cluster samples into interesting groups.

blind transformation to experimental design (TRUE / FALSE)

If many of genes have large differences in counts due to the experimental design, it is important to set blind=FALSE for downstream analysis.

Blind dispersion estimation is not the appropriate choice if one expects that many or the majority of genes (rows) will have large differences in counts which are explainable by the experimental design, and one wishes to transform the data for downstream analysis. In this case, using blind dispersion estimation will lead to large estimates of dispersion, as it attributes differences due to experimental design as unwanted noise, and will result in overly shrinking the transformed values towards each other. By setting blind to FALSE, the dispersions already estimated will be used to perform transformations, or if not present, they will be estimated using the current design formula. Note that only the fitted dispersion estimates from mean-dispersion trend line are used in the transformation (the global dependence of dispersion on mean for the entire experiment). So setting blind to FALSE is still for the most part not using the information about which samples were in which experimental group in applying the transformation.

varianceStabilizingTransformation

This function calculates a variance stabilizing transformation (VST) from the fitted dispersion mean relation(s) and then transforms the count data (normalized by division by the size factors or normalization factors), yielding a matrix of values which are now approximately homoskedastic (having constant variance along the range of mean values). The transformation also normalizes with respect to library size.

The rlog is less sensitive to size factors, which can be an issue when size factors vary widely. These transformations are useful when checking for outliers or as input for machine learning techniques such as clustering or linear discriminant analysis.

The transformed data should be approximated variance stabilized and also includes correction for size factors or normalization factors. The transformed data is on the log2 scale for large counts.

Limitations: In order to preserve normalization, the same transformation has to be used for all samples. This results in the variance stabilizition to be only approximate. The more the size factors differ, the more residual dependence of the variance on the mean will be found in the transformed data. rlog is a transformation which can perform better in these cases. As shown in the vignette, the function meanSdPlot from the package vsn can be used to see whether this is a problem.

NB! vst() is a wrapper for the varianceStabilizingTransformation() vst() provides much faster estimation of the dispersion trend used to determine the formula for the VST. The speed-up is accomplished by subsetting to a smaller number of genes in order to estimate this dispersion trend. The subset of genes is chosen deterministically, to span the range of genes’ mean normalized count. This wrapper for the VST is not blind to the experimental design: the sample covariate information is used to estimate the global trend of genes’ dispersion values over the genes’ mean normalized count. It can be made strictly blind to experimental design by first assigning a design of ~1 before running this function, or by avoiding subsetting and using varianceStabilizingTransformation.

fitType:

In case dispersions have not yet been estimated for object, this parameter is passed on to estimateDispersions().

If estimateDispersions was called with:

- fitType="parametric", a closed-form expression for the variance stabilizing transformation is used on the normalized count data. 
- fitType="local", the reciprocal of the square root of the variance of the normalized counts, as derived from the dispersion fit, is then numerically integrated, and the integral (approximated by a spline function) is evaluated for each count value in the column, yielding a transformed value.
- fitType="mean", a VST is applied for Negative Binomial distributed counts, ’k’, with a fixed dispersion, ’a’: ( 2 asinh(sqrt(a k)) - log(a) - log(4) )/log(2).

In all cases, the transformation is scaled such that for large counts, it becomes asymptotically (for large values) equal to the logarithm to base 2 of normalized counts.

transformed counts - vst Blind = TRUE

vsdBlindTrue <- DESeq2::varianceStabilizingTransformation(dds, blind = TRUE, fitType = "parametric")

boxplot(assay(vsdBlindTrue), col=meta$group)

transformed counts - vst Blind = FALSE

vsdBlindFalse <- DESeq2::varianceStabilizingTransformation(dds, blind = FALSE, fitType = "parametric")

boxplot(assay(vsdBlindFalse), col=meta$group)

rlogged transformation

Useful for various unsupervised clustering analyses.

Warning message (but not an error): “rlog() may take a few minutes with 30 or more samples, vst() is a much faster transformation”

transformed counts - rlogged Blind = TRUE

rlogged.BlindTrue = rlogTransformation(dds, blind = TRUE)

boxplot(assay(rlogged.BlindTrue), col=meta$group)

transformed counts - rlogged Blind = FALSE

rlogged.BlindFalse = rlogTransformation(dds, blind = FALSE)

boxplot(assay(rlogged.BlindFalse), col=meta$group)

Effects of transformations on the variance

The figure below plots the standard deviation of the transformed data, across samples, against the mean, using the shifted logarithm transformation, the regularized log transformation and the variance stabilizing transformation. The shifted logarithm has elevated standard deviation in the lower count range, and the regularized log to a lesser extent, while for the variance stabilized data the standard deviation is roughly constant along the whole dynamic range.

Note that the vertical axis in such plots is the square root of the variance over all samples, so including the variance due to the experimental conditions. While a flat curve of the square root of variance over the mean may seem like the goal of such transformations, this may be unreasonable in the case of datasets with many true differences due to the experimental conditions.

Plot row standard deviations versus row means

The scatterplot of these versus each other allows you to visually verify whether there is a dependence of the standard deviation (or variance) on the mean. The red line depicts the running median estimator (window-width 10%). If there is no variance-mean dependence, then the line should be approximately horizontal.

# this gives log2(n + 1)
ntd <- normTransform(dds)
meanSdPlot(assay(ntd))

meanSdPlot(assay(vsdBlindTrue))

meanSdPlot(assay(vsdBlindFalse))

meanSdPlot(assay(rlogged.BlindTrue))

meanSdPlot(assay(rlogged.BlindFalse)) 

Cluster dendrogram - rlog Blind = FALSE

dists.rlog.BlindFalse <- dist(t(assay(rlogged.BlindFalse)))
plot(hclust(dists.rlog.BlindFalse))

Cluster dendrogram - vst Blind = FALSE

dists.vsdBlindFalse <- dist(t(assay(vsdBlindFalse)))
plot(hclust(dists.vsdBlindFalse))

save transformed count data

The input file has to contain all the genes, not just differentially expressed ones. Note that you can use the resulting transformed values only for visualization and clustering (not for differential expression analysis which needs raw counts)

# save varianceStabilizingTransformation counts
saveRDS(vsdBlindFalse, file = "counts_vst.BlindFalse_Past_Host.rds")

# later, after you already created the file, can load the existing file:
#vsdBlindFalse <- readRDS("counts_vst.BlindFalse_Past_Host.rds")

Obtain normalized counts data

To perform the median of ratios method of normalization, DESeq2 has a single estimateSizeFactors() function that will generate size factors for us. We will use the function in the example below, but in a typical RNA-seq analysis this step is automatically performed by the DESeq2() function.

dds <- estimateSizeFactors(dds)
plot(sort(sizeFactors(dds)))

Normalization factor applied to each sample

#sort.list(sizeFactors(dds))
sizeFactors(dds)
##  NC_291_C_Pa_yr2  NC_294_C_Pa_yr2  NC_297_C_Pa_yr2  NC_300_C_Pa_yr2 
##       0.97422808       2.04403498       1.39252450       2.40697331 
##  NC_313_C_Pa_yr2  NC_316_C_Pa_yr2  NC_319_C_Pa_yr2  NC_322_C_Pa_yr2 
##       1.08638450       1.41160343       1.11012666       0.88245888 
##  NC_325_C_Pa_yr2 NO_265_La_Pa_yr2 NO_268_La_Pa_yr2 NO_271_La_Pa_yr2 
##       3.30476752       0.10260873       0.16937402       0.90067834 
## NO_274_La_Pa_yr2 NO_277_La_Pa_yr2 NO_280_La_Pa_yr2 NO_289_La_Pa_yr2 
##       0.72317112       0.99517320       0.07844891       2.13138789 
## NC_295_La_Pa_yr2 NC_298_La_Pa_yr2 NC_314_La_Pa_yr2 NC_317_La_Pa_yr2 
##       1.02617456       0.94739088       1.39114064       1.56042034 
## NC_326_La_Pa_yr2  NO_261_C_Pa_yr2  NO_264_C_Pa_yr2  NO_267_C_Pa_yr2 
##       1.04016264       1.73844008       2.05215854       1.60756762 
##  NO_270_C_Pa_yr2  NO_276_C_Pa_yr2  NO_279_C_Pa_yr2  NO_282_C_Pa_yr2 
##       0.55616714       1.23600502       1.92403145       0.88463113 
##  NO_288_C_Pa_yr2 
##       1.95277415
order(sizeFactors(dds), decreasing = TRUE)
##  [1]  9  4 16 23  2 29 27 22 24 20  6  3 19 26  7  5 21 17 14  1 18 12 28  8 13
## [26] 25 11 10 15

NOTE: DESeq2 doesn’t actually use normalized counts, rather it uses the raw counts and models the normalization inside the Generalized Linear Model (GLM).

These normalized counts will be useful for downstream visualization of results, but cannot be used as input to DESeq2 or any other tools that perform differential expression analysis which use the negative binomial model.

Normalized counts

# to retrieve the normalized counts matrix from dds
normalized_counts <- counts(dds, normalized=TRUE)

rownames(normalized_counts) <- rownames(host_means_filtered)

# Save normalized counts table as R data file for later use
saveRDS(normalized_counts, file = "counts_normalized_Past_Host.rds")

normalized_counts <- as.data.frame.array(normalized_counts)
setDT(normalized_counts, keep.rownames = "gene")

# Save normalized data matrix for later use:
write.csv(normalized_counts, "counts_normalized_Past_Host.csv", row.names = F)

Total number of raw counts per sample

colSums(counts(dds))[order(colSums(counts(dds)))]
## NO_265_La_Pa_yr2 NO_280_La_Pa_yr2  NO_270_C_Pa_yr2  NO_282_C_Pa_yr2 
##           622918          1715141          2779196          3204137 
##  NO_267_C_Pa_yr2  NO_276_C_Pa_yr2  NC_291_C_Pa_yr2  NC_322_C_Pa_yr2 
##          3350833          3453054          3596661          3871069 
##  NO_261_C_Pa_yr2 NO_274_La_Pa_yr2  NC_313_C_Pa_yr2 NC_326_La_Pa_yr2 
##          4210889          4528293          4805699          4853896 
## NC_298_La_Pa_yr2  NO_264_C_Pa_yr2  NC_294_C_Pa_yr2 NO_277_La_Pa_yr2 
##          5264769          6144922          6218322          6314252 
##  NC_316_C_Pa_yr2 NO_271_La_Pa_yr2  NC_297_C_Pa_yr2 NC_295_La_Pa_yr2 
##          6646842          6654163          6747210          7218588 
## NO_289_La_Pa_yr2 NO_268_La_Pa_yr2  NO_288_C_Pa_yr2  NC_319_C_Pa_yr2 
##          7333924          7624820          8380415          8495517 
## NC_314_La_Pa_yr2  NO_279_C_Pa_yr2 NC_317_La_Pa_yr2  NC_300_C_Pa_yr2 
##          8769233          8858276          9224904         12797270 
##  NC_325_C_Pa_yr2 
##         15330133

Barplot raw counts

colSums(counts(dds)) %>% barplot(col=meta$group)

Total number of normalized counts per sample

# How do the numbers correlate with the size factor?
# Now take a look at the total depth after normalization using:
#colSums(counts(dds, normalized=T))

colSums(counts(dds, normalized=T))[order(colSums(counts(dds, normalized=T)))]
##  NO_267_C_Pa_yr2  NO_261_C_Pa_yr2  NO_276_C_Pa_yr2  NO_264_C_Pa_yr2 
##          2084412          2422223          2793722          2994370 
##  NC_294_C_Pa_yr2 NO_289_La_Pa_yr2  NO_282_C_Pa_yr2  NC_291_C_Pa_yr2 
##          3042180          3440915          3622003          3691806 
##  NO_288_C_Pa_yr2  NC_322_C_Pa_yr2  NC_313_C_Pa_yr2  NO_279_C_Pa_yr2 
##          4291543          4386685          4423571          4604018 
##  NC_325_C_Pa_yr2 NC_326_La_Pa_yr2  NC_316_C_Pa_yr2  NC_297_C_Pa_yr2 
##          4638793          4666478          4708718          4845308 
##  NO_270_C_Pa_yr2  NC_300_C_Pa_yr2 NC_298_La_Pa_yr2 NC_317_La_Pa_yr2 
##          4997052          5316748          5557124          5911807 
## NO_265_La_Pa_yr2 NO_274_La_Pa_yr2 NC_314_La_Pa_yr2 NO_277_La_Pa_yr2 
##          6070809          6261717          6303628          6344877 
## NC_295_La_Pa_yr2 NO_271_La_Pa_yr2  NC_319_C_Pa_yr2 NO_280_La_Pa_yr2 
##          7034464          7387946          7652746         21863160 
## NO_268_La_Pa_yr2 
##         45017647

Barplot Normalized counts

colSums(counts(dds, normalized = T)) %>% barplot(col=meta$group)

Quality Control

A useful initial step in an RNA-seq analysis is often to assess overall similarity between samples:

To explore the similarity of our samples, we will be performing sample-level QC using Principal Component Analysis (PCA) and hierarchical clustering methods. Our sample-level QC allows us to see how well our replicates cluster together, as well as, observe whether our experimental condition represents the major source of variation in the data. Performing sample-level QC can also identify any sample outliers, which may need to be explored further to determine whether they need to be removed prior to DE analysis.

When using these unsupervised clustering methods, log2-transformation of the normalized counts improves the distances/clustering for visualization. DESeq2 uses a regularized log transform (rlog) of the normalized counts for sample-level QC as it moderates the variance across the mean, improving the clustering.

Here I use varianceStablizingTransformation() transformed counts

Principal Component Analysis (PCA)

Exploratory

By default the function uses the top 500 most variable genes. You can change this by adding the ntop argument and specifying how many genes you want to use to draw the plot.

PCA - Group

pcadata = DESeq2::plotPCA(vsdBlindFalse, intgroup = c( "group"), returnData = TRUE)
percentVar = round(100 * attr(pcadata, "percentVar"))
pca = prcomp(t(assay(vsdBlindFalse)), center = TRUE, scale. = FALSE)

DESeq2::plotPCA(vsdBlindFalse, returnData = TRUE, intgroup = c("group")) %>% 
      ggplot(aes(x = PC1, y = PC2)) +
      geom_point(aes(colour = group), size = 2) +
      stat_ellipse(geom = "polygon", alpha = 1/10, aes(fill = group)) +
      xlab(paste0("PC1: ",percentVar[1],"% variance")) +
      ylab(paste0("PC2: ",percentVar[2],"% variance")) +
      theme_cowplot()
## Warning in MASS::cov.trob(data[, vars]): Probable convergence failure

PCA - Origin

pcadata = DESeq2::plotPCA(vsdBlindFalse, intgroup = c("Origin_name"), returnData = TRUE)
percentVar = round(100 * attr(pcadata, "percentVar"))
pca = prcomp(t(assay(vsdBlindFalse)), center = TRUE, scale. = FALSE)

DESeq2::plotPCA(vsdBlindFalse, returnData = TRUE, intgroup = c("Origin_name") ) %>% 
      ggplot(aes(x = PC1, y = PC2)) +
      geom_point(aes(colour = Origin_name), size = 2) +
      stat_ellipse(geom = "polygon", alpha = 1/10, aes(fill = Origin_name)) +
      xlab(paste0("PC1: ",percentVar[1],"% variance")) +
      ylab(paste0("PC2: ",percentVar[2],"% variance")) +
      theme_cowplot()

PCA - Destination

pcadata = DESeq2::plotPCA(vsdBlindFalse, intgroup = c("Destination_name"), returnData = TRUE)
percentVar = round(100 * attr(pcadata, "percentVar"))
pca = prcomp(t(assay(vsdBlindFalse)), center = TRUE, scale. = FALSE)

DESeq2::plotPCA(vsdBlindFalse, returnData = TRUE, intgroup = c("Destination_name") ) %>% 
      ggplot(aes(x = PC1, y = PC2)) +
      geom_point(aes(colour = Destination_name), size = 2) +
      stat_ellipse(geom = "polygon", alpha = 1/10, aes(fill = Destination_name)) +
      xlab(paste0("PC1: ",percentVar[1],"% variance")) +
      ylab(paste0("PC2: ",percentVar[2],"% variance")) +
      theme_cowplot()

PCA - Colony

pcadata = DESeq2::plotPCA(vsdBlindFalse, intgroup = c("Colony_ID"), returnData = TRUE)
percentVar = round(100 * attr(pcadata, "percentVar"))
pca = prcomp(t(assay(vsdBlindFalse)), center = TRUE, scale. = FALSE)

DESeq2::plotPCA(vsdBlindFalse, returnData = TRUE, intgroup = c("Colony_ID")) %>% 
      ggplot(aes(x = PC1, y = PC2)) +
      geom_point(aes(colour = Colony_ID), size = 2) +
      xlab(paste0("PC1: ",percentVar[1],"% variance")) +
      ylab(paste0("PC2: ",percentVar[2],"% variance")) +
      theme_cowplot()

Hierarchical Clustering

Since there is no built-in function for heatmaps in DESeq2 we will be using the pheatmap() function from the pheatmap package. This function requires a matrix/dataframe of numeric values as input, and so the first thing we need to is retrieve that information from the rld object:

Extract the rlog matrix from the object

rld_mat <- assay(vsdBlindFalse) 

Compute pairwise correlation values

rld_cor <- cor(rld_mat)    
#head(rld_cor)   ## check the output of cor(), make note of the rownames and colnames

Plot the correlation values as a heatmap:

pheatmap(rld_cor)

Correlation of gene expression for all pairwise combinations of samples

Hierarchical Clustering Heatmap

This figure takes overall expression and clusters samples based on euclidean distances.

# sample by distance heatmap
sampleDists <- as.matrix(dist(t(assay(vsdBlindFalse))))
heatmap.2(as.matrix(sampleDists), key=F, trace="none",
          col=colorpanel(100, "black", "white"),
          margin=c(10, 10))

Gene-level QC

In addition to examining how well the samples/replicates cluster together, there are a few more QC steps. Prior to differential expression analysis it is beneficial to omit genes that have little or no chance of being detected as differentially expressed. This will increase the power to detect differentially expressed genes. The genes omitted fall into three categories:

- Genes with zero counts in all samples
- Genes with an extreme count outlier
- Genes with a low mean normalized counts (independent filtering)

DESeq2 will perform this filtering by default; however other DE tools, such as EdgeR will not. Filtering is a necessary step, even if you are using limma-voom and/or edgeR’s quasi-likelihood methods. Be sure to follow pre-filtering steps when using other tools, as outlined in their user guides found on Bioconductor as they generally perform much better.

Differential expression analysis

based on the Negative Binomial (a.k.a. Gamma-Poisson) distribution

Function performs a default analysis through the steps: - estimation of size factors: estimateSizeFactors - estimation of dispersion: estimateDispersions - Negative Binomial GLM fitting and Wald statistics: nbinomWaldTest

In this step we essentially want to determine whether the mean expression levels of different sample groups are significantly different.

Wald test: the shrunken estimate of LFC is divided by its standard error, resulting in a z-statistic, which is compared to a standard normal distribution.

In DESeq2, we assume that genes of similar average expression strength have similar dispersion. https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8

  • We first treat each gene separately and estimate gene-wise dispersion estimates (using maximum likelihood), which rely only on the data of each individual gene (black dots in Figure 1).
  • Next, we determine the location parameter of the distribution of these estimates; to allow for dependence on average expression strength, we fit a smooth curve, as shown by the red line in Figure 1.
  • This provides an accurate estimate for the expected dispersion value for genes of a given expression strength but does not represent deviations of individual genes from this overall trend.
  • We then shrink the gene-wise dispersion estimates toward the values predicted by the curve to obtain final dispersion values (blue arrow heads).
  • We use an empirical Bayes approach, which lets the strength of shrinkage depend (i) on an estimate of how close true dispersion values tend to be to the fit and (ii) on the degrees of freedom: as the sample size increases, the shrinkage decreases in strength, and eventually becomes negligible.
  • Our approach therefore accounts for gene-specific variation to the extent that the data provide this information, while the fitted curve aids estimation and testing in less information-rich settings.

fitType = parameteric

  • a closed-form expression for the variance stabilizing transformation (vst) is used on the normalized count data.
  • a dispersion-mean relation of the form:

    dispersion = asymptDisp + extraPois / mean

  • via a robust gamma-family GLM.
  • coefficients asymptDisp and extraPois are given in the attribute coefficients of the dispersionFunction of the object.

# Generate normalized counts
# perform the median of ratios method of normalization
# perform a Wald test in DESeq2 pairwise comparison between treatment effects

dds <- DESeq2::DESeq(dds, fitType = "parametric", test = "Wald")
## using pre-existing size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## -- replacing outliers and refitting for 19 genes
## -- DESeq argument 'minReplicatesForReplace' = 7 
## -- original counts are preserved in counts(dds)
## estimating dispersions
## fitting model and testing
# copy here text from below:
# -- replacing outliers and refitting for 19 genes
# -- DESeq argument 'minReplicatesForReplace' = 7 
# -- original counts are preserved in counts(dds)
results.dds = results(dds)
head(results.dds)
## log2 fold change (MLE): group Ojo.Ojo vs Lagoon.Control 
## Wald test p-value: group Ojo.Ojo vs Lagoon.Control 
## DataFrame with 6 rows and 6 columns
##                      baseMean log2FoldChange     lfcSE       stat    pvalue
##                     <numeric>      <numeric> <numeric>  <numeric> <numeric>
## GCKDGN101A08CL_Past   3.25973     -0.8353504  0.550043 -1.5186996 0.1288381
## GCKDGN101A0M5U_Past   8.41096     -0.9193626  0.523007 -1.7578391 0.0787749
## GCKDGN101A2BO2_Past   3.73717      0.0103628  0.529103  0.0195856 0.9843740
## GCKDGN101A2NQR_Past   4.55599     -0.5479719  0.487032 -1.1251244 0.2605363
## GCKDGN101A2V3C_Past   4.47120      0.4775614  0.802212  0.5953060 0.5516390
## GCKDGN101A32G8_Past   3.39789     -0.2879045  0.754891 -0.3813856 0.7029171
##                          padj
##                     <numeric>
## GCKDGN101A08CL_Past        NA
## GCKDGN101A0M5U_Past        NA
## GCKDGN101A2BO2_Past        NA
## GCKDGN101A2NQR_Past        NA
## GCKDGN101A2V3C_Past        NA
## GCKDGN101A32G8_Past        NA
dispersionFunction(dds)
## function (q) 
## coefs[1] + coefs[2]/q
## <bytecode: 0x7fc3196c1b68>
## <environment: 0x7fc31699b658>
## attr(,"coefficients")
## asymptDisp  extraPois 
##  0.1672212  0.4544850 
## attr(,"fitType")
## [1] "parametric"
## attr(,"varLogDispEsts")
## [1] 0.5768994
## attr(,"dispPriorVar")
## [1] 0.4936142

Fit curve to gene-wise dispersion estimates

This curve is displayed as a red line in the figure below, which plots the estimate for the expected dispersion value for genes of a given expression strength. Each black dot is a gene with an associated mean expression level and maximum likelihood estimation (MLE) of the dispersion

The curve allows for more accurate identification of differentially expressed genes when sample sizes are small, and the strength of the shrinkage for each gene depends on :

- how close gene dispersions are from the curve
- sample size (more samples = less shrinkage)

This shrinkage method is particularly important to reduce false positives in the differential expression analysis. Genes with low dispersion estimates are shrunken towards the curve, and the more accurate, higher shrunken values are output for fitting of the model and differential expression testing.

Dispersion estimates that are slightly above the curve are also shrunk toward the curve for better dispersion estimation; however, genes with extremely high dispersion values are not. This is due to the likelihood that the gene does not follow the modeling assumptions and has higher variability than others for biological or technical reasons [1]. Shrinking the values toward the curve could result in false positives, so these values are not shrunken. These genes are shown surrounded by blue circles below.

This is a good plot to examine to ensure your data is a good fit for the DESeq2 model. You expect your data to generally scatter around the curve, with the dispersion decreasing with increasing mean expression levels. If you see a cloud or different shapes, then you might want to explore your data more to see if you have contamination (mitochondrial, etc.) or outlier samples. Note how much shrinkage you get across the whole range of means in the plotDispEsts() plot for any experiment with low degrees of freedom.

Plot dispersion estimates - fitType = “parametric”

NB! “parametric” is the default, unless the fit fails, then automatically will be replaced with another fitType (if applicable, will see this in Warning message after running DESeq() function above)

# already accomplished as part of DESeq() function (above):
#dds <- DESeq2::estimateSizeFactors(dds)
#dds <- DESeq2::estimateDispersionsGeneEst(dds)
#dds <- DESeq2::estimateDispersionsFit(dds)
#dds <- DESeq2::estimateDispersionsMAP(dds)

DESeq2::plotDispEsts(dds)

head(dispersions(dds))  # parametric
## [1] 0.1537171 0.2815734 0.1871478 0.1570423 0.9065312 0.6900497

All of the intermediate values (gene-wise dispersion estimates, fitted dispersion estimates from the trended fit, etc.) are stored in mcols(dds), with information about these columns in mcols(mcols(dds)).

rownames(mcols(mcols(dds)))
##  [1] "baseMean"                                         
##  [2] "baseVar"                                          
##  [3] "allZero"                                          
##  [4] "dispGeneEst"                                      
##  [5] "dispGeneIter"                                     
##  [6] "dispFit"                                          
##  [7] "dispersion"                                       
##  [8] "dispIter"                                         
##  [9] "dispOutlier"                                      
## [10] "dispMAP"                                          
## [11] "Intercept"                                        
## [12] "group_Lagoon.Ojo_vs_Lagoon.Control"               
## [13] "group_Ojo.Control_vs_Lagoon.Control"              
## [14] "group_Ojo.Ojo_vs_Lagoon.Control"                  
## [15] "SE_Intercept"                                     
## [16] "SE_group_Lagoon.Ojo_vs_Lagoon.Control"            
## [17] "SE_group_Ojo.Control_vs_Lagoon.Control"           
## [18] "SE_group_Ojo.Ojo_vs_Lagoon.Control"               
## [19] "WaldStatistic_Intercept"                          
## [20] "WaldStatistic_group_Lagoon.Ojo_vs_Lagoon.Control" 
## [21] "WaldStatistic_group_Ojo.Control_vs_Lagoon.Control"
## [22] "WaldStatistic_group_Ojo.Ojo_vs_Lagoon.Control"    
## [23] "WaldPvalue_Intercept"                             
## [24] "WaldPvalue_group_Lagoon.Ojo_vs_Lagoon.Control"    
## [25] "WaldPvalue_group_Ojo.Control_vs_Lagoon.Control"   
## [26] "WaldPvalue_group_Ojo.Ojo_vs_Lagoon.Control"       
## [27] "betaConv"                                         
## [28] "betaIter"                                         
## [29] "deviance"                                         
## [30] "maxCooks"                                         
## [31] "replace"

Distribution of residuals

dispersions(dds) %>% hist(breaks = 500)

Differential expression analysis with DESeq2: model fitting and hypothesis testing

Evaluate contrasts

# It can be useful to include the sample names in the data set object:
rownames(dds) <- rownames(host_means_filtered)
levels(dds$group)
## [1] "Lagoon.Control" "Lagoon.Ojo"     "Ojo.Control"    "Ojo.Ojo"

Note that the ‘results’ function automatically performs independent filtering based on the mean of normalized counts for each gene, optimizing the number of genes which will have an adjusted p value below a given FDR cutoff, alpha.

Pairwise contrasts

In each pairwise contrast, we should have the treatment condition first, and the control condition second in the log2 fold change (MLE) output.

Groups: “Lagoon.Control” “Lagoon.Ojo” “Ojo.Control” “Ojo.Ojo”

Contrasts:

LO_LC “Lagoon.Ojo”, “Lagoon.Control”

OO_LO “Ojo.Ojo”, “Lagoon.Ojo”

OO_OC “Ojo.Ojo”, “Ojo.Control”

OC_LC “Ojo.Control”, “Lagoon.Control”

low pH vs. ambient pH, but not comparative sites OO_LC “Ojo.Ojo”, “Lagoon.Control”

This contrast does not make sense “Lagoon.Ojo”, “Ojo.Control”

“Lagoon.Ojo”, “Lagoon.Control”

p-adj < 0.1

results_LO_LC_.1 <- results(dds, contrast = c("group", "Lagoon.Ojo", "Lagoon.Control"), alpha = 0.1)
head(results_LO_LC_.1)
## log2 fold change (MLE): group Lagoon.Ojo vs Lagoon.Control 
## Wald test p-value: group Lagoon.Ojo vs Lagoon.Control 
## DataFrame with 6 rows and 6 columns
##                      baseMean log2FoldChange     lfcSE      stat    pvalue
##                     <numeric>      <numeric> <numeric> <numeric> <numeric>
## GCKDGN101A08CL_Past   3.25973      -0.100546  0.473566 -0.212317  0.831860
## GCKDGN101A0M5U_Past   8.41096      -0.319544  0.487517 -0.655451  0.512177
## GCKDGN101A2BO2_Past   3.73717       0.362214  0.488342  0.741723  0.458255
## GCKDGN101A2NQR_Past   4.55599      -0.767995  0.480757 -1.597472  0.110161
## GCKDGN101A2V3C_Past   4.47120      -0.287735  0.841301 -0.342012  0.732342
## GCKDGN101A32G8_Past   3.39789      -0.724619  0.774599 -0.935476  0.349543
##                          padj
##                     <numeric>
## GCKDGN101A08CL_Past  0.961119
## GCKDGN101A0M5U_Past  0.864419
## GCKDGN101A2BO2_Past  0.838703
## GCKDGN101A2NQR_Past  0.587789
## GCKDGN101A2V3C_Past  0.934422
## GCKDGN101A32G8_Past  0.790714
summary(results_LO_LC_.1)
## 
## out of 8118 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 26, 0.32%
## LFC < 0 (down)     : 11, 0.14%
## outliers [1]       : 3, 0.037%
## low counts [2]     : 1572, 19%
## (mean count < 3)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
# convert to dataframe (from DESeq2 object)
res_LO_LC_.1_df <- data.frame(results_LO_LC_.1)

DE_LO_LC_padj.1 <- res_LO_LC_.1_df %>% 
  rownames_to_column('gene') %>% 
  as_tibble() %>%
  filter(padj < 0.1) %>% 
    arrange(padj)

head(DE_LO_LC_padj.1)
## # A tibble: 6 x 7
##   gene               baseMean log2FoldChange lfcSE  stat      pvalue     padj
##   <chr>                 <dbl>          <dbl> <dbl> <dbl>       <dbl>    <dbl>
## 1 isogroup07295_Past    10.4           3.97  0.751  5.29 0.000000125 0.000815
## 2 isogroup04919_Past     6.44          2.26  0.458  4.94 0.000000798 0.00261 
## 3 isogroup06592_Past     5.51         -3.35  0.756 -4.43 0.00000931  0.0203  
## 4 isogroup01504_Past    31.7           1.61  0.381  4.22 0.0000241   0.0394  
## 5 isogroup25132_Past     4.26          2.73  0.664  4.11 0.0000392   0.0513  
## 6 GE907925_Past         19.7          -0.974 0.254 -3.84 0.000124    0.0640
# save rlogged file
saveRDS(DE_LO_LC_padj.1, file = "Past_Host_DE_genes_LO_LC_padj.1.rds")
write.csv(DE_LO_LC_padj.1, "Past_Host_DE_genes_LO_LC_padj.1.csv")

Pairwise contrast: up- vs. down-regulated genes

up.results_LO_LC_padj.1 = row.names(results_LO_LC_.1[results_LO_LC_.1$padj<0.1 & !(is.na(results_LO_LC_.1$padj)) & results_LO_LC_.1$log2FoldChange>0,])

up.results_LO_LC_padj.1 <- as.data.frame(up.results_LO_LC_padj.1)
up.results_LO_LC_padj.1$DEG <- 'Up'

down.results_LO_LC_padj.1 = row.names(results_LO_LC_.1[results_LO_LC_.1$padj<0.1 & !(is.na(results_LO_LC_.1$padj)) & results_LO_LC_.1$log2FoldChange<0,])

down.results_LO_LC_padj.1 <- as.data.frame(down.results_LO_LC_padj.1)
down.results_LO_LC_padj.1$DEG <- 'Down'

LO_LC_padj.1_summary <- merge(up.results_LO_LC_padj.1, down.results_LO_LC_padj.1, all=T)
dim(LO_LC_padj.1_summary)
## [1] 37  3
# save file
write.csv(LO_LC_padj.1_summary, file="Past_Host_DE_genes_LO_LC_padj.1_summary.csv")

arrange by lowest p-values

This is later used as input for PERMANOVA

DE_LO_LC_pvalue <- res_LO_LC_.1_df %>%
  rownames_to_column('gene') %>% 
  as_tibble() %>%
  filter(pvalue < 0.05) %>% 
    arrange(pvalue)

head(DE_LO_LC_pvalue)
## # A tibble: 6 x 7
##   gene               baseMean log2FoldChange lfcSE  stat      pvalue     padj
##   <chr>                 <dbl>          <dbl> <dbl> <dbl>       <dbl>    <dbl>
## 1 isogroup07295_Past    10.4            3.97 0.751  5.29 0.000000125 0.000815
## 2 isogroup04919_Past     6.44           2.26 0.458  4.94 0.000000798 0.00261 
## 3 isogroup06592_Past     5.51          -3.35 0.756 -4.43 0.00000931  0.0203  
## 4 isogroup01504_Past    31.7            1.61 0.381  4.22 0.0000241   0.0394  
## 5 isogroup25132_Past     4.26           2.73 0.664  4.11 0.0000392   0.0513  
## 6 isogroup01499_Past    16.3            1.28 0.328  3.91 0.0000910   0.0640
# save rlogged and csv files
saveRDS(DE_LO_LC_pvalue, file = "Past_Host_LO_LC_pval.05.rds")
write.csv(DE_LO_LC_pvalue, "Past_Host_LO_LC_pval.05.csv")
dim(DE_LO_LC_pvalue)
## [1] 762   7

“Ojo.Ojo”, “Lagoon.Ojo”

p-adj < 0.1

results_OO_LO_.1 <- results(dds, contrast = c("group", "Ojo.Ojo", "Lagoon.Ojo"), alpha = 0.1)
head(results_OO_LO_.1)
## log2 fold change (MLE): group Ojo.Ojo vs Lagoon.Ojo 
## Wald test p-value: group Ojo.Ojo vs Lagoon.Ojo 
## DataFrame with 6 rows and 6 columns
##                      baseMean log2FoldChange     lfcSE      stat    pvalue
##                     <numeric>      <numeric> <numeric> <numeric> <numeric>
## GCKDGN101A08CL_Past   3.25973      -0.734804  0.618690 -1.187678  0.234960
## GCKDGN101A0M5U_Past   8.41096      -0.599819  0.592547 -1.012273  0.311407
## GCKDGN101A2BO2_Past   3.73717      -0.351852  0.588924 -0.597449  0.550208
## GCKDGN101A2NQR_Past   4.55599       0.220023  0.584474  0.376447  0.706585
## GCKDGN101A2V3C_Past   4.47120       0.765296  0.930365  0.822576  0.410749
## GCKDGN101A32G8_Past   3.39789       0.436715  0.885266  0.493315  0.621790
##                          padj
##                     <numeric>
## GCKDGN101A08CL_Past        NA
## GCKDGN101A0M5U_Past  0.996627
## GCKDGN101A2BO2_Past        NA
## GCKDGN101A2NQR_Past        NA
## GCKDGN101A2V3C_Past        NA
## GCKDGN101A32G8_Past        NA
summary(results_OO_LO_.1)
## 
## out of 8118 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 31, 0.38%
## LFC < 0 (down)     : 2, 0.025%
## outliers [1]       : 3, 0.037%
## low counts [2]     : 4561, 56%
## (mean count < 6)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
# convert to dataframe (from DESeq2 object)
res_OO_LO_.1_df <- data.frame(results_OO_LO_.1)

DE_OO_LO_padj.1 <- res_OO_LO_.1_df %>% 
  rownames_to_column('gene') %>% 
  as_tibble() %>%
  filter(padj < 0.1) %>% 
    arrange(padj)

head(DE_OO_LO_padj.1)
## # A tibble: 6 x 7
##   gene               baseMean log2FoldChange lfcSE  stat   pvalue     padj
##   <chr>                 <dbl>          <dbl> <dbl> <dbl>    <dbl>    <dbl>
## 1 isogroup20691_Past     492.           8.61 1.17   7.35 1.98e-13 5.10e-10
## 2 isogroup23261_Past    4472.           7.95 1.09   7.30 2.87e-13 5.10e-10
## 3 isogroup14250_Past   42378.           7.83 1.12   6.97 3.13e-12 3.71e- 9
## 4 isogroup06096_Past  109607.           7.37 1.09   6.75 1.52e-11 1.35e- 8
## 5 isogroup09729_Past     487.           9.87 1.50   6.59 4.39e-11 3.12e- 8
## 6 isogroup00977_Past    9564.           6.01 0.921  6.52 6.91e-11 3.77e- 8
# save rlogged file
saveRDS(DE_OO_LO_padj.1, file = "Past_Host_DE_genes_OO_LO_padj.1.rds")
write.csv(DE_OO_LO_padj.1, "Past_Host_DE_genes_OO_LO_padj.1.csv")

Pairwise contrast: up- vs. down-regulated genes

up.results_OO_LO_padj.1 = row.names(results_OO_LO_.1[results_OO_LO_.1$padj<0.1 & !(is.na(results_OO_LO_.1$padj)) & results_OO_LO_.1$log2FoldChange>0,])

up.results_OO_LO_padj.1 <- as.data.frame(up.results_OO_LO_padj.1)
up.results_OO_LO_padj.1$DEG <- 'Up'

down.results_OO_LO_padj.1 = row.names(results_OO_LO_.1[results_OO_LO_.1$padj<0.1 & !(is.na(results_OO_LO_.1$padj)) & results_OO_LO_.1$log2FoldChange<0,])

down.results_OO_LO_padj.1 <- as.data.frame(down.results_OO_LO_padj.1)
down.results_OO_LO_padj.1$DEG <- 'Down'

OO_LO_padj.1_summary <- merge(up.results_OO_LO_padj.1, down.results_OO_LO_padj.1, all=T)
dim(OO_LO_padj.1_summary)
## [1] 33  3
# save file
write.csv(OO_LO_padj.1_summary, file="Past_Host_DE_genes_OO_LO_padj.1_summary.csv")

arrange by lowest p-values

This is later used as input for PERMANOVA

DE_OO_LO_pvalue <- res_OO_LO_.1_df %>%
  rownames_to_column('gene') %>% 
  as_tibble() %>%
  filter(pvalue < 0.05) %>% 
    arrange(pvalue)

head(DE_OO_LO_pvalue)
## # A tibble: 6 x 7
##   gene               baseMean log2FoldChange lfcSE  stat   pvalue     padj
##   <chr>                 <dbl>          <dbl> <dbl> <dbl>    <dbl>    <dbl>
## 1 isogroup20691_Past     492.           8.61 1.17   7.35 1.98e-13 5.10e-10
## 2 isogroup23261_Past    4472.           7.95 1.09   7.30 2.87e-13 5.10e-10
## 3 isogroup14250_Past   42378.           7.83 1.12   6.97 3.13e-12 3.71e- 9
## 4 isogroup06096_Past  109607.           7.37 1.09   6.75 1.52e-11 1.35e- 8
## 5 isogroup09729_Past     487.           9.87 1.50   6.59 4.39e-11 3.12e- 8
## 6 isogroup00977_Past    9564.           6.01 0.921  6.52 6.91e-11 3.77e- 8
# save rlogged and csv files
saveRDS(DE_OO_LO_pvalue, file = "Past_Host_OO_LO_pval.05.rds")
write.csv(DE_OO_LO_pvalue, "Past_Host_OO_LO_pval.05.csv")
dim(DE_OO_LO_pvalue)
## [1] 284   7

“Ojo.Ojo”, “Ojo.Control”

p-adj < 0.1

results_OO_OC_.1 <- results(dds, contrast = c("group", "Ojo.Ojo", "Ojo.Control"), alpha = 0.1)
head(results_OO_OC_.1)
## log2 fold change (MLE): group Ojo.Ojo vs Ojo.Control 
## Wald test p-value: group Ojo.Ojo vs Ojo.Control 
## DataFrame with 6 rows and 6 columns
##                      baseMean log2FoldChange     lfcSE       stat    pvalue
##                     <numeric>      <numeric> <numeric>  <numeric> <numeric>
## GCKDGN101A08CL_Past   3.25973     -0.0226067  0.587343 -0.0384898  0.969297
## GCKDGN101A0M5U_Past   8.41096     -0.6712426  0.536637 -1.2508307  0.210996
## GCKDGN101A2BO2_Past   3.73717     -0.1840043  0.535949 -0.3433243  0.731355
## GCKDGN101A2NQR_Past   4.55599     -0.2047477  0.505031 -0.4054160  0.685172
## GCKDGN101A2V3C_Past   4.47120      1.0191806  0.833302  1.2230628  0.221306
## GCKDGN101A32G8_Past   3.39789      0.6135298  0.793720  0.7729803  0.439534
##                          padj
##                     <numeric>
## GCKDGN101A08CL_Past        NA
## GCKDGN101A0M5U_Past  0.458441
## GCKDGN101A2BO2_Past        NA
## GCKDGN101A2NQR_Past        NA
## GCKDGN101A2V3C_Past        NA
## GCKDGN101A32G8_Past        NA
summary(results_OO_OC_.1)
## 
## out of 8118 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 127, 1.6%
## LFC < 0 (down)     : 215, 2.6%
## outliers [1]       : 3, 0.037%
## low counts [2]     : 4719, 58%
## (mean count < 6)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
# convert to dataframe (from DESeq2 object)
res_OO_OC_.1_df <- data.frame(results_OO_OC_.1)

DE_OO_OC_padj.1 <- res_OO_OC_.1_df %>% 
  rownames_to_column('gene') %>% 
  as_tibble() %>%
  filter(padj < 0.1) %>% 
    arrange(padj)

head(DE_OO_OC_padj.1)
## # A tibble: 6 x 7
##   gene               baseMean log2FoldChange lfcSE  stat   pvalue     padj
##   <chr>                 <dbl>          <dbl> <dbl> <dbl>    <dbl>    <dbl>
## 1 isogroup14250_Past   42378.          10.6  0.993 10.6  2.15e-26 7.29e-23
## 2 isogroup23261_Past    4472.           9.77 0.966 10.1  4.83e-24 8.20e-21
## 3 isogroup06096_Past  109607.           9.56 0.965  9.90 4.06e-23 4.59e-20
## 4 isogroup30333_Past   40951.           9.38 0.955  9.82 9.12e-23 7.75e-20
## 5 isogroup20691_Past     492.          10.1  1.06   9.54 1.41e-21 9.59e-19
## 6 isogroup28947_Past     996.           9.28 1.01   9.21 3.23e-20 1.83e-17
# save rlogged file
saveRDS(DE_OO_OC_padj.1, file = "Past_Host_DE_genes_OO_OC_padj.1.rds")
write.csv(DE_OO_OC_padj.1, "Past_Host_DE_genes_OO_OC_padj.1.csv")

Pairwise contrast: up- vs. down-regulated genes

up.results_OO_OC_padj.1 = row.names(results_OO_OC_.1[results_OO_OC_.1$padj<0.1 & !(is.na(results_OO_OC_.1$padj)) & results_OO_OC_.1$log2FoldChange>0,])

up.results_OO_OC_padj.1 <- as.data.frame(up.results_OO_OC_padj.1)
up.results_OO_OC_padj.1$DEG <- 'Up'

down.results_OO_OC_padj.1 = row.names(results_OO_OC_.1[results_OO_OC_.1$padj<0.1 & !(is.na(results_OO_OC_.1$padj)) & results_OO_OC_.1$log2FoldChange<0,])

down.results_OO_OC_padj.1 <- as.data.frame(down.results_OO_OC_padj.1)
down.results_OO_OC_padj.1$DEG <- 'Down'

OO_OC_padj.1_summary <- merge(up.results_OO_OC_padj.1, down.results_OO_OC_padj.1, all=T)
dim(OO_OC_padj.1_summary)
## [1] 342   3
# save file
write.csv(OO_OC_padj.1_summary, file="Past_Host_DE_genes_OO_OC_padj.1_summary.csv")

arrange by lowest p-values

This is later used as input for PERMANOVA

DE_OO_OC_pvalue <- res_OO_OC_.1_df %>%
  rownames_to_column('gene') %>% 
  as_tibble() %>%
  filter(pvalue < 0.05) %>% 
    arrange(pvalue)

head(DE_OO_OC_pvalue)
## # A tibble: 6 x 7
##   gene               baseMean log2FoldChange lfcSE  stat   pvalue     padj
##   <chr>                 <dbl>          <dbl> <dbl> <dbl>    <dbl>    <dbl>
## 1 isogroup14250_Past   42378.          10.6  0.993 10.6  2.15e-26 7.29e-23
## 2 isogroup23261_Past    4472.           9.77 0.966 10.1  4.83e-24 8.20e-21
## 3 isogroup06096_Past  109607.           9.56 0.965  9.90 4.06e-23 4.59e-20
## 4 isogroup30333_Past   40951.           9.38 0.955  9.82 9.12e-23 7.75e-20
## 5 isogroup20691_Past     492.          10.1  1.06   9.54 1.41e-21 9.59e-19
## 6 isogroup28947_Past     996.           9.28 1.01   9.21 3.23e-20 1.83e-17
# save rlogged and csv files
saveRDS(DE_OO_OC_pvalue, file = "Past_Host_OO_OC_pval.05.rds")
write.csv(DE_OO_OC_pvalue, "Past_Host_OO_OC_pval.05.csv")
dim(DE_OO_OC_pvalue)
## [1] 1142    7

“Ojo.Control”, “Lagoon.Control”

p-adj < 0.1

results_OC_LC_.1 <- results(dds, contrast = c("group", "Ojo.Control", "Lagoon.Control"), alpha = 0.1)
head(results_OC_LC_.1)
## log2 fold change (MLE): group Ojo.Control vs Lagoon.Control 
## Wald test p-value: group Ojo.Control vs Lagoon.Control 
## DataFrame with 6 rows and 6 columns
##                      baseMean log2FoldChange     lfcSE      stat    pvalue
##                     <numeric>      <numeric> <numeric> <numeric> <numeric>
## GCKDGN101A08CL_Past   3.25973      -0.812744  0.431810 -1.882177  0.059812
## GCKDGN101A0M5U_Past   8.41096      -0.248120  0.417782 -0.593899  0.552580
## GCKDGN101A2BO2_Past   3.73717       0.194367  0.422952  0.459549  0.645840
## GCKDGN101A2NQR_Past   4.55599      -0.343224  0.380228 -0.902681  0.366695
## GCKDGN101A2V3C_Past   4.47120      -0.541619  0.732531 -0.739381  0.459676
## GCKDGN101A32G8_Past   3.39789      -0.901434  0.668057 -1.349337  0.177229
##                          padj
##                     <numeric>
## GCKDGN101A08CL_Past  0.867038
## GCKDGN101A0M5U_Past  0.977203
## GCKDGN101A2BO2_Past  0.984412
## GCKDGN101A2NQR_Past  0.961032
## GCKDGN101A2V3C_Past  0.967736
## GCKDGN101A32G8_Past  0.909221
summary(results_OC_LC_.1)
## 
## out of 8118 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 1, 0.012%
## LFC < 0 (down)     : 0, 0%
## outliers [1]       : 3, 0.037%
## low counts [2]     : 0, 0%
## (mean count < 0)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
# convert to dataframe (from DESeq2 object)
res_OC_LC_.1_df <- data.frame(results_OC_LC_.1)

DE_OC_LC_padj.1 <- res_OC_LC_.1_df %>% 
  rownames_to_column('gene') %>% 
  as_tibble() %>%
  filter(padj < 0.1) %>% 
    arrange(padj)

head(DE_OC_LC_padj.1)
## # A tibble: 1 x 7
##   gene               baseMean log2FoldChange lfcSE  stat      pvalue    padj
##   <chr>                 <dbl>          <dbl> <dbl> <dbl>       <dbl>   <dbl>
## 1 isogroup07701_Past     967.           2.60 0.526  4.93 0.000000802 0.00651
# save rlogged file
saveRDS(DE_OC_LC_padj.1, file = "Past_Host_DE_genes_OC_LC_padj.1.rds")
write.csv(DE_OC_LC_padj.1, "Past_Host_DE_genes_OC_LC_padj.1.csv")

Pairwise contrast: up- vs. down-regulated genes

up.results_OC_LC_padj.1 = row.names(results_OC_LC_.1[results_OC_LC_.1$padj<0.1 & !(is.na(results_OC_LC_.1$padj)) & results_OC_LC_.1$log2FoldChange>0,])

up.results_OC_LC_padj.1 <- as.data.frame(up.results_OC_LC_padj.1)
up.results_OC_LC_padj.1$DEG <- 'Up'

dim(up.results_OC_LC_padj.1)
## [1] 1 2
#down.results_OC_LC_padj.1 = row.names(results_OC_LC_.1[results_OC_LC_.1$padj<0.1 & !(is.na(results_OC_LC_.1$padj)) & results_OC_LC_.1$log2FoldChange<0,])

#down.results_OC_LC_padj.1 <- as.data.frame(down.results_OC_LC_padj.1)
#down.results_OC_LC_padj.1$DEG <- 'Down'

#OC_LC_padj.1_summary <- merge(up.results_OC_LC_padj.1, down.results_OC_LC_padj.1, all=T)
#dim(OC_LC_padj.1_summary)

# save file
write.csv(up.results_OC_LC_padj.1, file="Past_Host_DE_genes_OC_LC_padj.1_summary.csv")

arrange by lowest p-values

This is later used as input for PERMANOVA

DE_OC_LC_pvalue <- res_OC_LC_.1_df %>%
  rownames_to_column('gene') %>% 
  as_tibble() %>%
  filter(pvalue < 0.05) %>% 
    arrange(pvalue)

head(DE_OC_LC_pvalue)
## # A tibble: 6 x 7
##   gene               baseMean log2FoldChange lfcSE  stat      pvalue    padj
##   <chr>                 <dbl>          <dbl> <dbl> <dbl>       <dbl>   <dbl>
## 1 isogroup07701_Past   967.             2.60 0.526  4.93 0.000000802 0.00651
## 2 isogroup04988_Past     8.99           1.97 0.523  3.77 0.000164    0.450  
## 3 isogroup06150_Past     4.88          -2.75 0.739 -3.72 0.000198    0.450  
## 4 isogroup12584_Past     3.85           1.91 0.522  3.65 0.000257    0.450  
## 5 isogroup10149_Past     5.46           1.26 0.356  3.55 0.000379    0.450  
## 6 isogroup04577_Past     4.46           1.35 0.387  3.50 0.000466    0.450
# save rlogged and csv files
saveRDS(DE_OC_LC_pvalue, file = "Past_Host_OC_LC_pval.05.rds")
write.csv(DE_OC_LC_pvalue, "Past_Host_OC_LC_pval.05.csv")
dim(DE_OC_LC_pvalue)
## [1] 456   7

low pH vs. ambient pH, but not comparative sites

“Ojo.Ojo”, “Lagoon.Control”

p-adj < 0.1

results_OO_LC_.1 <- results(dds, contrast = c("group", "Ojo.Ojo", "Lagoon.Control"), alpha = 0.1)
head(results_OO_LC_.1)
## log2 fold change (MLE): group Ojo.Ojo vs Lagoon.Control 
## Wald test p-value: group Ojo.Ojo vs Lagoon.Control 
## DataFrame with 6 rows and 6 columns
##                      baseMean log2FoldChange     lfcSE       stat    pvalue
##                     <numeric>      <numeric> <numeric>  <numeric> <numeric>
## GCKDGN101A08CL_Past   3.25973     -0.8353504  0.550043 -1.5186996 0.1288381
## GCKDGN101A0M5U_Past   8.41096     -0.9193626  0.523007 -1.7578391 0.0787749
## GCKDGN101A2BO2_Past   3.73717      0.0103628  0.529103  0.0195856 0.9843740
## GCKDGN101A2NQR_Past   4.55599     -0.5479719  0.487032 -1.1251244 0.2605363
## GCKDGN101A2V3C_Past   4.47120      0.4775614  0.802212  0.5953060 0.5516390
## GCKDGN101A32G8_Past   3.39789     -0.2879045  0.754891 -0.3813856 0.7029171
##                          padj
##                     <numeric>
## GCKDGN101A08CL_Past        NA
## GCKDGN101A0M5U_Past        NA
## GCKDGN101A2BO2_Past        NA
## GCKDGN101A2NQR_Past        NA
## GCKDGN101A2V3C_Past        NA
## GCKDGN101A32G8_Past        NA
summary(results_OO_LC_.1)
## 
## out of 8118 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 92, 1.1%
## LFC < 0 (down)     : 104, 1.3%
## outliers [1]       : 3, 0.037%
## low counts [2]     : 6450, 79%
## (mean count < 10)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
# convert to dataframe (from DESeq2 object)
res_OO_LC_.1_df <- data.frame(results_OO_LC_.1)

DE_OO_LC_padj.1 <- res_OO_LC_.1_df %>% 
  rownames_to_column('gene') %>% 
  as_tibble() %>%
  filter(padj < 0.1) %>% 
    arrange(padj)

head(DE_OO_LC_padj.1)
## # A tibble: 6 x 7
##   gene               baseMean log2FoldChange lfcSE  stat   pvalue     padj
##   <chr>                 <dbl>          <dbl> <dbl> <dbl>    <dbl>    <dbl>
## 1 isogroup14250_Past   42378.          10.4  0.967  10.8 3.71e-27 6.18e-24
## 2 isogroup23261_Past    4472.          10.1  0.941  10.7 1.06e-26 8.80e-24
## 3 isogroup28947_Past     996.          10.5  0.996  10.6 3.89e-26 1.81e-23
## 4 isogroup30333_Past   40951.           9.83 0.930  10.6 4.35e-26 1.81e-23
## 5 isogroup06096_Past  109607.           9.91 0.940  10.5 5.73e-26 1.91e-23
## 6 isogroup28285_Past    4405.          10.2  0.981  10.4 2.77e-25 7.69e-23
# save rlogged file
saveRDS(DE_OO_LC_padj.1, file = "Past_Host_DE_genes_OO_LC_padj.1.rds")
write.csv(DE_OO_LC_padj.1, "Past_Host_DE_genes_OO_LC_padj.1.csv")

Pairwise contrast: up- vs. down-regulated genes

up.results_OO_LC_padj.1 = row.names(results_OO_LC_.1[results_OO_LC_.1$padj<0.1 & !(is.na(results_OO_LC_.1$padj)) & results_OO_LC_.1$log2FoldChange>0,])

up.results_OO_LC_padj.1 <- as.data.frame(up.results_OO_LC_padj.1)
up.results_OO_LC_padj.1$DEG <- 'Up'

down.results_OO_LC_padj.1 = row.names(results_OO_LC_.1[results_OO_LC_.1$padj<0.1 & !(is.na(results_OO_LC_.1$padj)) & results_OO_LC_.1$log2FoldChange<0,])

down.results_OO_LC_padj.1 <- as.data.frame(down.results_OO_LC_padj.1)
down.results_OO_LC_padj.1$DEG <- 'Down'

OO_LC_padj.1_summary <- merge(up.results_OO_LC_padj.1, down.results_OO_LC_padj.1, all=T)
dim(OO_LC_padj.1_summary)
## [1] 196   3
# save file
write.csv(OO_LC_padj.1_summary, file="Past_Host_DE_genes_OO_LC_padj.1_summary.csv")

arrange by lowest p-values

This is later used as input for PERMANOVA

DE_OO_LC_pvalue <- res_OO_LC_.1_df %>%
  rownames_to_column('gene') %>% 
  as_tibble() %>%
  filter(pvalue < 0.05) %>% 
    arrange(pvalue)

head(DE_OO_LC_pvalue)
## # A tibble: 6 x 7
##   gene               baseMean log2FoldChange lfcSE  stat   pvalue     padj
##   <chr>                 <dbl>          <dbl> <dbl> <dbl>    <dbl>    <dbl>
## 1 isogroup14250_Past   42378.          10.4  0.967  10.8 3.71e-27 6.18e-24
## 2 isogroup23261_Past    4472.          10.1  0.941  10.7 1.06e-26 8.80e-24
## 3 isogroup28947_Past     996.          10.5  0.996  10.6 3.89e-26 1.81e-23
## 4 isogroup30333_Past   40951.           9.83 0.930  10.6 4.35e-26 1.81e-23
## 5 isogroup06096_Past  109607.           9.91 0.940  10.5 5.73e-26 1.91e-23
## 6 isogroup28285_Past    4405.          10.2  0.981  10.4 2.77e-25 7.69e-23
# save rlogged and csv files
saveRDS(DE_OO_LC_pvalue, file = "Past_Host_OO_LC_pval.05.rds")
write.csv(DE_OO_LC_pvalue, "Past_Host_OO_LC_pval.05.csv")
dim(DE_OO_LC_pvalue)
## [1] 851   7

mcols(results_OO_LC_.1)$description
## [1] "mean of normalized counts for all samples"              
## [2] "log2 fold change (MLE): group Ojo.Ojo vs Lagoon.Control"
## [3] "standard error: group Ojo.Ojo vs Lagoon.Control"        
## [4] "Wald statistic: group Ojo.Ojo vs Lagoon.Control"        
## [5] "Wald test p-value: group Ojo.Ojo vs Lagoon.Control"     
## [6] "BH adjusted p-values"

Session Info

sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS High Sierra 10.13.4
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
## 
## attached base packages:
##  [1] stats4    parallel  grid      stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] affy_1.66.0                 locfit_1.5-9.4             
##  [3] arrayQualityMetrics_3.44.0  DEGreport_1.24.1           
##  [5] DESeq2_1.28.1               SummarizedExperiment_1.18.2
##  [7] DelayedArray_0.14.1         matrixStats_0.58.0         
##  [9] GenomicRanges_1.40.0        GenomeInfoDb_1.24.2        
## [11] IRanges_2.22.2              S4Vectors_0.26.1           
## [13] BiocManager_1.30.10         data.table_1.14.0          
## [15] gridExtra_2.3               knitr_1.31                 
## [17] plyr_1.8.6                  vsn_3.56.0                 
## [19] Biobase_2.48.0              BiocGenerics_0.34.0        
## [21] devtools_2.3.2              usethis_2.0.1              
## [23] ggrepel_0.9.1               cowplot_1.1.1              
## [25] vegan_2.5-7                 lattice_0.20-41            
## [27] permute_0.9-5               forcats_0.5.1              
## [29] stringr_1.4.0               dplyr_1.0.4                
## [31] purrr_0.3.4                 readr_1.4.0                
## [33] tidyr_1.1.2                 tibble_3.1.0               
## [35] ggplot2_3.3.3               tidyverse_1.3.0            
## [37] ComplexHeatmap_2.4.3        gplots_3.1.1               
## [39] pheatmap_1.0.12             circlize_0.4.12            
## [41] RColorBrewer_1.1-2         
## 
## loaded via a namespace (and not attached):
##   [1] utf8_1.1.4                  tidyselect_1.1.0           
##   [3] beadarray_2.38.0            htmlwidgets_1.5.3          
##   [5] RSQLite_2.2.3               AnnotationDbi_1.50.3       
##   [7] BiocParallel_1.22.0         munsell_0.5.0              
##   [9] codetools_0.2-18            preprocessCore_1.50.0      
##  [11] withr_2.4.1                 colorspace_2.0-0           
##  [13] highr_0.8                   rstudioapi_0.13            
##  [15] setRNG_2013.9-1             labeling_0.4.2             
##  [17] lasso2_1.2-21.1             GenomeInfoDbData_1.2.3     
##  [19] hwriter_1.3.2               mnormt_2.0.2               
##  [21] farver_2.0.3                bit64_4.0.5                
##  [23] rprojroot_2.0.2             vctrs_0.3.6                
##  [25] generics_0.1.0              xfun_0.21                  
##  [27] R6_2.5.0                    illuminaio_0.30.0          
##  [29] clue_0.3-58                 gridSVG_1.7-2              
##  [31] bitops_1.0-6                cachem_1.0.4               
##  [33] reshape_0.8.8               assertthat_0.2.1           
##  [35] scales_1.1.1                nnet_7.3-15                
##  [37] gtable_0.3.0                processx_3.4.5             
##  [39] rlang_0.4.10                genefilter_1.70.0          
##  [41] systemfonts_1.0.1           GlobalOptions_0.1.2        
##  [43] splines_4.0.3               hexbin_1.28.2              
##  [45] checkmate_2.0.0             broom_0.7.5                
##  [47] reshape2_1.4.4              yaml_2.2.1                 
##  [49] modelr_0.1.8                backports_1.2.1            
##  [51] Hmisc_4.4-2                 tools_4.0.3                
##  [53] psych_2.0.12                logging_0.10-108           
##  [55] affyio_1.58.0               ellipsis_0.3.1             
##  [57] jquerylib_0.1.3             ggdendro_0.1.22            
##  [59] sessioninfo_1.1.1           Rcpp_1.0.6                 
##  [61] base64enc_0.1-3             zlibbioc_1.34.0            
##  [63] RCurl_1.98-1.2              ps_1.5.0                   
##  [65] prettyunits_1.1.1           openssl_1.4.3              
##  [67] rpart_4.1-15                GetoptLong_1.0.5           
##  [69] haven_2.3.1                 cluster_2.1.1              
##  [71] fs_1.5.0                    magrittr_2.0.1             
##  [73] reprex_1.0.0                tmvnsim_1.0-2              
##  [75] pkgload_1.2.0               hms_1.0.0                  
##  [77] evaluate_0.14               xtable_1.8-4               
##  [79] XML_3.99-0.5                jpeg_0.1-8.1               
##  [81] gcrma_2.60.0                readxl_1.3.1               
##  [83] shape_1.4.5                 testthat_3.0.2             
##  [85] compiler_4.0.3              KernSmooth_2.23-18         
##  [87] crayon_1.4.1                htmltools_0.5.1.1          
##  [89] mgcv_1.8-34                 Formula_1.2-4              
##  [91] geneplotter_1.66.0          lubridate_1.7.10           
##  [93] DBI_1.1.1                   dbplyr_2.1.0               
##  [95] MASS_7.3-53.1               Matrix_1.3-2               
##  [97] cli_2.3.1                   pkgconfig_2.0.3            
##  [99] foreign_0.8-81              xml2_1.3.2                 
## [101] svglite_1.2.3.2             annotate_1.66.0            
## [103] BeadDataPackR_1.40.0        bslib_0.2.4                
## [105] affyPLM_1.64.0              XVector_0.28.0             
## [107] rvest_0.3.6                 callr_3.5.1                
## [109] digest_0.6.27               ConsensusClusterPlus_1.52.0
## [111] Biostrings_2.56.0           base64_2.0                 
## [113] rmarkdown_2.7               cellranger_1.1.0           
## [115] htmlTable_2.1.0             edgeR_3.30.3               
## [117] gdtools_0.2.3               gtools_3.8.2               
## [119] rjson_0.2.20                lifecycle_1.0.0            
## [121] nlme_3.1-152                jsonlite_1.7.2             
## [123] askpass_1.1                 desc_1.2.0                 
## [125] limma_3.44.3                fansi_0.4.2                
## [127] pillar_1.5.0                Nozzle.R1_1.1-1            
## [129] fastmap_1.1.0               httr_1.4.2                 
## [131] pkgbuild_1.2.0              survival_3.2-7             
## [133] glue_1.4.2                  remotes_2.2.0              
## [135] png_0.1-7                   bit_4.0.4                  
## [137] stringi_1.5.3               sass_0.3.1                 
## [139] blob_1.2.1                  latticeExtra_0.6-29        
## [141] caTools_1.18.1              memoise_2.0.0